Skip to content

[Feature] Support batched vector index training with sampling ratio#8350

Draft
jerry-024 wants to merge 5 commits into
apache:masterfrom
jerry-024:fix_paimon_vector_index_train_limit
Draft

[Feature] Support batched vector index training with sampling ratio#8350
jerry-024 wants to merge 5 commits into
apache:masterfrom
jerry-024:fix_paimon_vector_index_train_limit

Conversation

@jerry-024

@jerry-024 jerry-024 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Purpose

This PR updates native vector global index building to train through the VectorIndexTrainer / VectorIndexTraining API instead of materializing all training vectors in one Java float[].

Main changes:

  • Bump paimon-vector-index-java to 0.2.0-SNAPSHOT for the separated trainer/training API.
  • Add train.sample-ratio for vector index training, with default 1.0 so the default path still trains with all non-null vectors.
  • Support both index-level and field-level configuration, for example <index-type>.train.sample-ratio and fields.<field-name>.train.sample-ratio; field-level configuration takes precedence.
  • When train.sample-ratio is less than 1.0, sample training vectors evenly from the temporary vector file and feed them to the native trainer in batches, while still adding all vectors to the final index.
  • Avoid the previous single large Java training array allocation and protect both training and add batches from oversized Java array allocations.
  • Log the selected training sample count and large native-memory estimates.

Tests

@jerry-024 jerry-024 marked this pull request as draft June 25, 2026 02:50
@jerry-024 jerry-024 changed the title [feature] support batched paimon vector index training [Feature] Support sampled batched vector index training Jun 30, 2026
@jerry-024 jerry-024 changed the title [Feature] Support sampled batched vector index training [Feature] Support batched vector index training with sampling ratio Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant